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ABSTRACT 


^Supercomputers  capable  of  performing  extremely  nigh 
speed  computation  nave  been  proposed  wnicn  are  based  on  an 
architecture  known  as  data  flow.  Application  of  a  Petri 
net-based  metnodology  is  used  to  evaluate  tne  performance 
attainable  by  such  an  architecture.  The  architecture 
evaluated  is  MIT's  cell  block  data  flow  arcnitecture  vnlch 
is  being  developed  to  execute  the  applicative  programming 
language  FA L. 

Results  snow  that  for  the  data  flow  architecture  to 
achieve  its  goal  of  high  speed  computation,  intelligent 
multiprogramming  schemes  need  to  be  developed.  One  such 
scheme,  based  on  tne  notion  of  a  "concurrency  vector”,  is 


introduced 
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I.  INTRODUCTION 


A.  BACKGROUND 

Despite  tne  orders-of-magnltude  increase  in  computation 
speed  that  nas  occurrei  since  the  early  1950's,  tne  need 
still  exists  today  for  faster  computers.  This  need  is  most 
critical  in  tne  area  of  scientific  computing,  wnere  there 
exist  computations  requiring  on  the  order  of  a  Pillion 
floating  point  operations  per  second  [DENNIS,  1980] . 

One  approach  to  achieving  higher  computation  speed  is  to 
increase  tne  speed  of  tne  basic  logic  devices  of  the 
computer.  This  approach,  effective  in  the  past,  faces 
significant  obstacles  to  future  gains  because  of  the  speed 
of  light  limitation  to  signal  propagation  and  limitations  in 
the  Integrated  circuit  manufacturing  process. 

A  second  approacn  to  acnleving  nlgner  computation  speed 
is  througn  the  exploitation  of  parallelism  wnich  is  (or  can 
be)  innerent  in  algorithms  used  to  solve  a  wide  range  of 
scientific  problems.  Such  parallelism  can  be  present  at  both 
the  operation  and  procedure  levels  in  a  program.  Thus  far, 
such  exploitation  of  parallelism  has  not  reached  a  limiting 
tnreshold  to  faster  computation. 


Data  flow  computing  has  been  proposed  as  a  conceptually 
viable  method  of  achieving  nlgher  computational  speed 
through  greater  exploitation  of  Inherent  algorithmic 


parallelism.  A  computer  based  on  the  data  flow  concept 
executes  an  instruction  when  its  operands  become  available. 
No  sequential  control  flow  notion  exists.  Data  flow  programs 
are  free  of  sequencing  constraints  except  tnose  imposed  by 
the  flow  of  operands  between  instructions.  Thus,  a  data  flow 
computer  contrasts  fundamentally  with  the  "von  Neumann" 
model.  Sven  so,  the  data  flow  concept  is  capable  of 
incorporating  into  one  system  all  the  known  forms  of 
parallelism  exploitation  including  vectorization , 
pipelining,  and  multiprocessing. 

B.  RESEARCH  APPROACH 

It  was  the  purpose  of  this  research  to  gain  insights 
into  the  degree  of  parallelism  exploitation  obtainable  with 
the  data  flow-based  high  speed  computation  method.  The 
classical  Issues  of  hardware  utilization,  program  execution 
time,  and  "degree"  of  multiprogramming  were  investigated  in 
the  context  of  data  flow.  Application  of  an  existing  Petri 
net-based  methodology  was  the  technique  used  to  gain  these 
insights. 

The  hypothesis  for  this  research  had  two  parts.  First, 
the  suitability  of  the  Petri  net-based  Requester-Server 
methodology  for  prediction  of  the  performance  of  data  flow 
machines  in  an  efficient,  accurate  manner  was  to  be 
explored.  Second,  a  challenge  to  the  data  flow  concept  was 
made.  It  was  hypothesized  that  the  goal  of  achieving  higher 
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speed  computation  tnrougn  data  flow  computing  Is 
unattainable  without  acnleving  a  high  and  "intelligent" 
degree  of  multiprogramming.  By  "intelligent”  It  is  meant 
that  the  mappln*  of  processes  onto  the  hardware  shall  nave 
to  be  done  In  a  near  optimal  fashion,  defined  In  terms  of 
hardware  utilization  and  program  execution  time. 

To  explore  tne  two-part  nypothesis,  sets  of  Petri  net 
models  of  data  flow  programs,  characterized  by  a  range  of 
Inherent  parallelism,  were  "executed"  on  Petri  net  models  of 
tne  Dennis-Misunas  data  flow  hardware  design  [DENNIS,  AUG 
1974],  uslnr  the  methodology  called  Requester-Server  [COX, 
1978].  The  hardware  models  were  varied  In  the  number  of 
processing  elements  available  for  use  in  "executing"  tne 
sets  of  program  models.  Thus  software  models  were  "run"  on 
nardware  models,  and  appropriate  performance  indices  were 
measured  and  analyzed. 

This  research  is  important  because  it  suggests  a  metnod 
for  mapping  data  flow  programs  onto  the  data  flow  machine  to 
achieve  tne  desired  degree  of  hign  speed  computation. 
Additionally,  the  Requester-Server  (R-S)  methodology  has 
been  shown  to  be  an  effective  tool  for  predicting  the 
performance  of  data  flow  computer  architectures. 

Grateful  acknowledgment  is  made  to  L.  A.  Cox,  designer 
and  initial  lmplementer  of  tne  R-S  methodology,  and  to  D.  M. 
Stowers,  who  modified  the  R-S  software  to  enable  it  to  run 
on  tne  PDP-11/50  minicomputer  at  NPS. 
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C.  ORGANIZATION 


Tne  results  reported  nere  ere  organized  in  a  fasnion 
conducive  to  tae  communication  of  experimental  computer 
science  endeavors.  Following  a  review  of  tae  applicable 
literature  (Section  II),  tne  aypotaesis  (Section  III)  is 
presented.  Next,  tae  method  used  to  test  tae  aypotaesis  is 
presented  in  detail  (Section  17).  This  section  discusses  tae 
experimental  design  which  includes  tae  identification  of 
independent  and  dependent  variables,  characterizes  and 
explains  tae  Petri  net  definition  of  the  data  flow  hardware 
and  software,  and  ends  wita  an  account  of  tae  procedure  used 
to  Implement  the  experiment  to  test  the  hypothesis.  Results 
of  tae  experiment  and  a  discussion  thereof  are  covered  in 
Section  7.  This  section,  in  addition  to  demonstrating  tne 
suitability  of  tne  R-S  technique,  and  exploring  tae 
multiprogramming  response  of  data  flow  architectures , 
presents  some  unexpected  findings.  Section  71  summarizes  the 
entire  research  effort,  including  the  results.  Finally, 
Section  71 1  presents  recommendations  for  furtner 
investigation  in  the  area  of  data  flow  research. 


II.  LITBRATURE  RE UtV 


A.  APPROACHES  TO  PARALLELISM 

In  general,  computer  science  literature  approaches  tae 
concept  of  parallelism  exploitation  from  eltner  an 
architecture  (hardware)  or  language  (software)  point  of 
'▼lew.  In  tals  tnesis,  "parallelism"  snail  be  viewed  as 
existing  at  many  hierarchical  levels  within  algorithms.  Any 
of  several  different  computer  architectures  may  he  capable 
of  exploiting  the  parallelism  which  exists  at  one  or  more  of 
these  various  hierarchical  levels.  As  should  be  expected, 
each  architecture  is  best-suited  at  exploiting  inherent 
algorithmic  parallelism  at  a  particular  hierarchical  level, 
but  not  at  others.  In  contrast,  Implementation  of  tne  data 
flow  concept  proposes  to  exploit  inherent  algorithmic 
parallelism  at  all  hierarchical  levels,  in  an  efficient 
fashion.  Before  presenting  tne  concept  of  data  flow,  a 
review  of  tne  range  of  architectures  and  strategies 
currently  used  to  exploit  parallelism  is  presented. 

In  an  early  study  of  nleh  speed  computer  architectures 
[FLTNN,  19S6J  a  four  element  taxonomy  was  developed  wnicn 
classified  computer  systems  in  terms  of  the  amount  of 
parallelism  in  their  instruction  streams  and  data  streams 
(see  figure  II. A. 1).  (An  instruction  stream  is  tne  series  of 
operations  used  by  the  processor;  a  data  stream  is  the 
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series  of  operands  used  by  the  processor.)  Tne  first  element 
of  this  taionomy,  depicted  in  figure  11.4.1(a),  is  the 
serial  computer  wnich  executes  one  instruction  at  a  time, 
affecting  at  most  one  data  item  at  a  time.  Sucn  a  serial 
machine  is  denoted  as  a  single-instruction 
single-data-stream  (SISD)  computer.  Tne  SISD  computer  can  be 
characterized  as  possessing  no  capability  ror  exploitation 
of  algorithmic  parallelism.  The  three  remaining  computer 
system  organizations  within  the  Flynn  taxonomy  do  possess 
capabilities  for  exploiting  algorithmic  parallelism. 

By  allowing  more  than  one  data  stream  a 
single-instruction  multiple-data-stream  computer  results,  as 
shown  in  figure  II. A. 1(b).  Tnis  organization  allows 
vectorization  and  is  Known  as  a  vector  or  array  processor 
because  each  instruction  operates  on  a  data  vector  during 
each  Instruction  cycle,  rather  than  on  just  one  operand.  Tne 

model  in  figure  II. A. 1(b)  shows  N  processors  each  accepting 
as  input  its  own  data  stream.  It  is  notewortny  tnat  eacn  of 
tne  N  processors  is  not  a  standalone  serial  maenine  (SISD 
computer)  because  tne  N  processors  taice  tne  same  instruction 
from  an  external  control  unit  at  each  time  step. 

If  tne  SISD  computer  is  extended  to  permit  more  tnan  one 
instruction  stream,  toe  multiple-instruction 
single-data-stream  (MISD)  computer  shown  in  figure  II. A. 1(c) 
results.  This  computer  system  organization  within  Flynn's 
taxonomy  is  present  for  completeness  but  has  yet  to  be  shown 
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*)  Mode*  of  *n  S*SO  computer 


t»  Mode*  ol  on  S>MO  computer 


c|  Model  ol  en  Ml  SO  computer 


d)  Model  of  en  MtMO  computer 


Figure  II. A.  1:  COMPUTER  MODELS  [STONE,  19  80] 
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to  possess  much  utility,  in  example  of  sucft  a  macblne  would 
be  one  built  to  generate  tables  of  functions  (sucn  as 
squares  and  square  roots)  of  a  stream  of  numbers.  Each 
processor  would  perform  a  different  function  on  tne  same 
data  Item  at  eacb  time  step. 

Tne  fourtn  and  final  element  of  Flynn's  taxonomy  Is  one 
wblch  possesses  parallelism  In  both  the  Instruction  and  data 
streams.  This  multiple-instruction  multlple-da ta-stream 
(MIMD)  computer  (snown  in  figure  II. A. 1(d))  Is  made  up  of  N 
complete  SISD  machines  which  are  Interconnected  for 
communication  purposes.  Sucn  a  parallel  architecture  is  more 
readily  recognized  as  a  multiprocessor  in  which  as  many  as  N 
processors  can  be  performing  useful  wort  at  the  same  time. 

Beyond  Flynn's  taxonomy  are  otner  approacnes  to 
parallelism.  The  first,  pipelining,  is  a  strategy  which 
mates  use  of  tne  fact  that  a  processor,  in  executing  an 
instruction,  actually  performs  a  sequence  of  functions  in 
various  functional  units  of  tne  processor.  Each  function  is 
performed  at  a  different  stage  alone  the  pipeline.  Figure 
II. A. 2  shows  a  processor  with  a  simple  pipeline  design. 
Rather  than  waiting  for  eacn  Instruction  to  be  completely 
executed  before  beginning  the  next  instruction,  the  pipeline 
processor  begins  execution  of  the  next  instruction  as  soon 
as  functional  units  at  the  beginning  of  the  pipeline  are 
available.  Thus,  tne  pipeline  is  normally  full,  containing 
more  than  one  instructions  in  various  stages  of  execution. 


The  final  approach  to  parallelism  to  be  presented  is  the 
strategy  of  overlapping.  In  the  traditional  sense, 
overlapping  within  a  computer  system  occurs  when  the  central 
processing  unit  (CPU)  is  allowed  to  function  concurrently 
with  input/output  (I/O)  operations.  Such  concurrency  was 
prevented  in  early  computers  because  I/O  operations  required 
data  paths  to  memory  which  ran  through  CPU  registers, 
preventing  CPU  functions  from  occurring  while  performlrg 
I/O.  Overlapping  can  occur  in  other  ways  within  a  computer 
but  the  example  given  is  sufficient  to  convey  the  general 
idea . 

The  techniques  for  exploiting  algorithmic  parallelism 
tnat  nave  been  presented  are  not  all  mutually  exclusive.  For 
example,  the  strategies  of  pipelining  and  overlapping  can  be 
included  in  any  of  the  four  architectures.  Furthermore, 
other  more  complex  machine  organizations  have  been  proposed. 
One  example  is  the  multiple  SIMD  (MSIMD)  machine  which 
consists  of  more  than  one  control  units  snaring  a  pool  of 
processors  through  a  switching  network  [HWANG,  1979] .  Such 
hybrids  will  not  be  considered  furtner. 

Having  described  the  major  architectural  approaches  to 
exploiting  algorithmic  parallelism  it  is  appropriate  to 
characterize  the  problems  for  which  each  method  is  suitable 
and  to  present  some  of  the  difficulties  that  still  exist  in 
using  each  method.  By  "suitable**  it  is  meant  that  toe  method 
allows  the  processing  of  a  problem  in  such  a  manner  that 


some  speedup  In  execution  time  Is  acnieved  In  comparison 
with  what  tne  execution  time  would  be  for  the  problem  run  on 
a  serial  macnlne. 

The  main  Implementation  of  the  SIMD  architecture,  the 
array  processor,  Is  suitable  for  computations  whlcn  can  be 
described  b j  vector  instructions.  Also,  operands  processed 
simultaneously  must  be  capable  of  being  fetched 
simultaneously  from  memory.  Finally,  processor 
Interconnections  must  support  high  speed  data  routine 
between  processors.  If  any  of  tne  above  conditions  are  not 
met,  then  the  computation  may  execute  in  a  predominantly 
serial  fasnlon  within  this  SIMD  computer.  Because  of  these 
required  conditions,  the  array  processor  Is  generally 
considered  to  be  a  specialised,  rather  than  general 
purpose,  machine. 

As  previously  mentioned,  the  MISD  architecture  exists 
merely  to  complete  tne  Flynn  taxonomy  and  will  not  be 
discussed  further  [STONE,  I960]. 

The  dominant  MIMD  computer  is  the  multiprocessor.  The 
multiprocessor  Is  considered  to  be  a  general  purpose 
machine.  Accordingly,  many  problems  should  be  well-suited 
for  execution  on  such  an  architecture.  Despite  the  fact  that 
such  systems  have  been  shown  to  work  well  In  a  number  of 
applications,  especially  those  which  consist  of  a  number  of 
concurrently-processable  subproblems  with  minimal  data 
sharing,  numerous  questions  have  yet  to  be  answered.  These 


questions  include  now  to  oest  organize  tne  parallel 
computations  (i.  e.  partition  tne  problem)  to  optimize  the 
use  of  tne  cooperating  processors,  now  to  synchronize  tne 
processors  in  the  system,  and  how  to  best  share  the  data 
among  system  processors.  Also,  problems  which  possess  an 
iterative  structure  can  run  efficiently  on  an  array 
processor  and  avoid  the  overhead  of  synchronization  and 
scheduling  required  of  the  multiprocessor  [STONE,  1960J . 

Although  not  considered  "architectures”  in  the  sense  of 
Flynn's  taxonomy,  both  pipelining  and  overlapping  (of  wnicn 
there  exist  different  types)  are  general  purpose  strategies 


that  can  be 

applied 

to  most  problems. 

Lilce 

tne 

mul tiprocessor. 

these 

techniques  also 

permit 

tne 

partitioning  of  a  problem  so  tnat  several  operating  hardware 
pieces  can  function  concurrently.  Accordingly,  pipelining 
and  overlapping  are  often  considered  to  be  forms  of 
multiprocessing.  Tae  difference  lies  in  the  fact  that 
pipelining  and  overlapping  perform  partitioning  at  different 
hierarchical  levels  of  a  problem  than  does  the 
multiprocessing  technique. 

Armed  with  an  understanding  of  the  diverse  architectural 
approaches  to  parallelism  exploitation  that  nave  been  used 
to  date,  it  is  logical  to  proceed  with  an  alternative 
approach,  that  of  data  flow.  Before  doing  so,  nowever,  Petri 
nets  will  be  introduced.  This  is  appropriate  because  the 
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concepts  of  data  flow  computation  are  a  direct  application 
of  Petri  net  tneory. 


B.  PETRI  NETS 


Petri  net  theory 

plays  an  Important 

part 

in 

this 

research  endeavor  for 

two  reasons.  First, 

as 

has 

been 

mentioned,  Petri  net  tneory  forms  the  basis  for  the  concepts 
used  to  describe  and  define  data  flow  computation.  Second, 
Petri  net  theory  Is  the  basis  for  the  Requester-Server 
methodology  that  Is  used  in  this  thesis  research  as  a 
computer  performance  prediction  tool.  Because  of  Its 
applicability,  Petri  net  theory  snail  be  presented  herein, 
in  an  Informal  manner,  with  empnasls  placed  on  Its  use  In 
modelling  parallel  computation.  Those  desiring  a  more  formal 
and  complete  discussion  of  Petri  nets  are  referred  to 
[PETERSON,  1977]. 

Petri  nets  may  be  thought  of  as  formal,  abstract  models 
of  Information  flow.  Their  main  use  has  been  In  the 
modelling  of  systems  of  events  in  which  some  events  may 
occur  concurrently  but  there  exist  constraints  on  the 
frequency,  precedence  and  concurrence  of  these  events.  A 
Petri  net  graph  models  the  static  structure  of  a  system.  The 
dynamic  properties  of  a  system  can  be  represented  by 
"executing"  tne  Petri  net  in  response  to  the  flow  of 
Information  (or  occurrence  of  events)  in  the  system. 
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The  static  graph  of  a  Petri  net  is  made  up  of  two  types 
of  nodes:  circles  (called  places)  wnlcn  represent 
conditions*  and  bars  (called  transitions)  which  represent 
events.  These  nodes  are  connected  by  directed  arcs  running 
from  either  places  to  transitions  or  transitions  to  places. 
The  source  of  a  directed  arc  is  the  input,  and  the  terminal 
node  is  the  output.  The  position  of  information  in  a  net  is 
represented  by  markers  called  tokens. 

Tne  dynamic  execution  of  a  Petri  net  is  controlled  by 
the  position  and  movement  of  the  tokens.  A  token  moves  as  a 
result  of  a  transition  firing.  In  order  for  a  transition  to 
fire,  it  must  be  enabled.  A  transition  is  enabled  when  all 
of  the  places  which  are  inputs  to  a  transition  are  marked 
with  a  token.  Upon  transition  firing,  a  token  is  removed 
from  each  of  the  input  places  and  a  token  is  placed  on  each 
of  the  output  places  of  toe  transition.  Thus,  in  modelling 
the  dynamic  behavior  of  a  system,  the  occurrence  of  an  event 
is  represented  by  the  firing  of  the  corresponding 
transition. 

Figures  II.B.l  through  II. B. 4  snow  a  Petri  net  at 
progressive  stages  of  execution.  As  can  be  observed,  the 
status  of  the  execution  at  a  given  time  can  be  described  by 
the  distribution  of  the  tokens  in  the  net.  This  distribution 
of  tokens  in  a  Petri  net  is  called  the  net  marking  and 
uniquely  defines  the  state  of  the  net  for  any  given  instant. 
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Petri  nets  are  uninterpreted  models.  Thus,  some 
significance  must  be  attached  to  token  movement  to  indicate 
the  Intent  of  the  model.  This  is  usually  done  by  labelling 
the  nodes  of  a  net  to  correspond  in  some  way  to  the  system 
being  modelled.  However,  it  snould  be  remembered  that  tne 
labelling  of  the  nodes  of  a  Petri  net  in  no  way  affects  its 
execution,  A  second  attribute  of  Petri  nets  is  their  ability 
to  model  a  system  hierarcnicaliy .  An  entire  net  may  be 
replaced  by  a  single  node  (place  or  transition)  for 
modelling  at  a  greater  level  of  abstraction  or,  conversely, 
a  single  node  may  be  replaced  by  a  subnet  to  show  greater 
detail  in  the  model. 

Petri  nets,  as  a  formal  grapn  model,  are  especially 
useful  in  modelling  the  flow  of  information  and  control  in 
systems  wnlch  can  be  characterized  by  asynchronous  and 
concurrent  behavior.  Figure  II. B. 5  shows  the  initial  marking 
of  a  Petri  net  model  of  such  a  system.  Initially,  transition 
El  is  enabled  because  each  of  its  input  places.  Cl  and  C2, 
is  marked  with  a  token.  Firing  transition  El  removes  one 
token  each  from  places  Cl  and  C2,  and  puts  a  token  into  eacn 
output  place,  C3  and  C4.  At  this  point  in  the  net  execution, 
transition  E3  is  disabled  because  one  of  its  input  places, 
C5,  still  has  no  token.  Transition  E2,  nowever,  is  enabled, 
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constraint,  tnat  of  event  E3  naving  to  wait  until  event  E2 
completes.  (Jpon  flrlne  transition  E3,  places  C6  and  C7 
become  named  vltn  toirens  as  places  C4  and  C5  lose  a  token 
each.  Transitions  E4  and  E5  are  now  enabled  and  can  fire 
simultaneously,  the  occurrence  of  which  corresponds  to 
concurrent  events  in  a  modelled  system.  Doing  so,  tnat  is, 
firing  transitions  B4  and  B5,  brings  the  Petri  net  model 
back  to  Its  original  (initial)  configuration. 

One  other  situation  that  can  be  represented  usin*  Petri 
nets  is  tnat  of  conflict.  Figure  II. B. 6  shows  a  net  model  of 
such  a  situation.  Simply,  transitions  El  and  E2  are  both 
enabled.  However,  if  either  transition  fires,  the  remaining 
transition  becomes  disabled.  In  such  a  case,  it  is  an 
arbitrary  decision  as  to  which  one  fires.  Because  we  would 
like  to  be  able  to  duplicate  experiments  and  obtain  the  same 
results,  a  scheme  that  is  often  used  involves  simply 
assigning  priorities  to  transitions  which  are  subject  to 
conflict  in  a  net.  In  tnis  way  reproducible  results  can  be 
ensured.  If  true  nondeterminism  is  desired  in  such  a  model, 
a  scheme  in  wnlcn  probabilities  are  associated  with  each 
transition  can  effectively  model  nondeterminism  in  tne 
system  under  study. 

Thus,  to  properly  model  a  system  with  Petri  nets,  every 
sequence  of  events  in  the  modelled  system  should  be  possible 
in  tne  Petri  net  and  every  sequence  of  events  in  tne  Petri 
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Figure  II.B.l:  MARKED  PETRI  NET,  TIME=0 


Figure  II. B. 2:  MARKED  PETRI  NET,  TIME=1 
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net  should  represent  a  possible  sequence  In  the  modelled 
system. 

This  section  nas  Introduced  Petri  nets  and  demonstrated 
tneir  usefulness  in  formally  modelling  Information  and 
control  flow  in  systems  characterized  by  asynchronous  and 
concurrent  behavior.  Headers  interested  in  the  use  of  Petri 
nets  for  performance  evaluation  of  sucn  systems  are  referred 
to  [RAMAMOORTHT ,  I960]  and  [RAMCHANDANI ,  1974].  The 
following  section  snail  Introduce  readers  to  the  concept  of 
data  flow  which,  as  mentioned  previously,  is  a  direct 
application  of  Petri  net  tneory. 

C.  CONCEPT  OP  DATA  PLOW 

Data  flow  computing  is  a  method  of  multiprocessing  wnich 
proposes  to  exploit  inherent  algorithmic  parallelism  at  all 
hierarchical  levels  within  a  program.  Additional  objectives 
include  effectively  using  the  capabilities  of  LSI  technology 
and  simplifying  the  programming  task.  The  concept  of 
computation  under  data  flow  was  derived  by  Dennis  [DENNIS, 
1974]  (and  a  number  of  others  working  independently), 
predominantly  from  Karp  and  Miller's  [KARP,  1966]  work  on 
computation  graphs.  This  section  begins  by  presenting  the 
data  flow  concept  from  the  perspective  of  language,  rather 
than  that  of  architecture.  This  approach  is  appropriate  in 
view  of  the  fact  that  data  flow  computer  systems  are  being 
designed  as  hardware  interpreters  for  a  base  language  that 
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Is  fundamentally  different  from  conventional  languages.  A 
hardware  description  of  the  Dennis-Misunas  data  flow 
architecture  design  completes  tnis  section. 

In  a  data  flow  computer,  an  instruction  is  executed  as 
soon  as  its  operands  become  available.  No  notion  of  separate 
control  flow  exists  because  the  data  dependencies  define  the 
flow  of  control  in  a  data  flow  program.  In  fact,  data  flow 
computers  have  no  need  for  a  proeram  location  counter. 

This  contrasts  with  the  traditional  "von  Neumann" 
computer  architecture  model  which  uses  a  global  memory  wnose 
state  is  altered  by  the  sequential  execution  of 
Instructions.  Such  a  model  is  limited  by  a  "bottleneck" 
between  the  computer  control  unit  and  the  global  memory 
[BACKUS,  1978].  This  "feature"  allows  conventional  languages 
to  have  side-effects,  a  common  example  of  whlcn  is  tne 
ability  of  a  procedure  to  modify  variables  in  the  calling 
program.  Such  side-effects  are  pronlbited  under  tne  data 
flow  concept.  Furthermore,  in  data  flow,  no  variables  exist, 
nor  are  there  any  scope  or  substitution  rules.  In  fact,  toe 
data  flow  concept  prohibits  the  modification  of  anytning 
that  has  a  value.  Rather,  data  flow  computing  takes  Inputs 
(operands)  and  generates  outputs  (results)  that  have  not 
previously  been  deflnel.  Thus,  instructions  in  data  flow  are 
pure  functions.  This  is  necessary  so  that  instruction 
execution  can  be  based  solely  on  the  availability  of  data 
(operands).  Thus  the  data  dependencies  must  be  equivalent 


to*  and  In  fact  deflney  tne  sequencing  constraints  In  a 
program.  Also,  to  exploit  parallelism  at  all  levels.  It  must 
be  possible  to  derive  tnese  data  dependencies  from  tne  nlgn 
level  language  program  instructions  [ACKERMAN,  1979J . 

A  language  which  allows  processing  by  means  of  operators 
applied  to  values  is  called  an  applicative  language.  VAL 
(Value-oriented  Algorltnmlc  Language)  Is  a  nlgn  level  data 
flow  applicative  language  under  development  at  MIT  [ACKERMAN 
and  DENNIS,  1979] .  It  prevents  any  side-effects  by  requiring 
programmers  to  write  expressions  and  functions,  statements 
and  subroutines  are  not  allowed  in  tbe  language.  Because  of 
tbls  constraint,  most  concurrency  Is  apparent  In  a  bleb 
level  language  program  written  In  VAL.  For  tne  purposes  of 
this  research,  no  further  understanding  of  the  high  level 
language  of  data  flow  Is  required.  Information  about  high 
level  language  alternatives  is  available  in  [McGRAW,  1984!]  , 
[ACKERMAN  and  DENNIS,  1979],  and  [ACKERMAN,  1979]. 

At  wnat  would  correspond  to  the  assembly  language  level, 
a  data  flow  computation  can  be  represented  as  a  graph.  Tne 
nodes  of  the  graph  correspond  to  operators  and  the  arcs 
represent  data  paths.  An  arc  into  a  node  represents  an  input 
operand  path,*  an  arc  leaving  a  node  corresponds  to  a  result 
patn.  Data  flow  graph  execution  occurs  as  operands  become 
available  at  each  node.  Vhen  the  Input  arcs  of  a  node  each 
have  a  value  on  them,  the  node  can  execute  by  removing  those 
values,  computing  tne  operation,  and  placing  tne  results  on 
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otion  Stats  (X,Y,Z:  r««l  rtturai  »•»!,  rnnl) 
let 

Mean  rnal  :=  (X  +  Y  +  Z)  /  3; 

SD  raal  :=  SQRT(  (X2  +  Y2  +  Z2)  /  3  - 

in 

Mean  ,  SD 

•adlti 

ndfua 


Figure  II. C. 2;  A  SIMPLE  STATISTICS 
FUNCTION  AND  ITS  DATA  FLOW  GRAPH 
[McGRAW.  1980] 


the  output  arcs  (see  Figure  II.C.l).  The  example  data  flow 
graph  In  figure  II. C. 2  computes  tne  mean  and  standard 
deviation  of  Its  three  Input  parameters. 

Such  a  graph  notation  Is  useful  in  Illustrating  tne 
various  levels  of  parallelism  in  a  program.  For  example,  a 
graph  node  may  represent  a  simple  operator  such  as  addition, 
or  the  entire  statistics  function  of  figure  II. C. 2.  Tnus  tne 
data  flow  graph  notation  can  represent  parallelism  eilstlng 
at  the  operator,  function  and  even  computation  level.  The 
graphs  execute  asynchronously,  nodes  firing  when  data  Inputs 
are  available.  Tnus  no  synchronization  problem  exists  witn 
regard  to  accessing  shared  data.  Each  data  flow  path  can  be 
marked  with  a  value  by  only  one  operator  node.  Once  a  value 
Is  on  a  path,  no  operator  ‘can  modify  that  value.  The  value 
can  only  be  read  when  used  as  an  Input  to  another  node 
IMcGRAW,  1990]. 

Again,  the  data  flow  graph  notation  merely  allows  a 
logical  representation  of  a  program  at  a  level  corresponding 
to  conventional  assembly  language.  Tnis  logical 
representation  snail  now  be  extended  to  permit  tne  reader  to 
understand  the  basic  data  flow  hardware  instruction 
execution  mecnanism.  A  simple  example  computation  that  shall 
be  used  to  facilitate  reader  understanding  is  the  following: 

Z»(X*I  )*(!-!) 
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Figure  II. C. 4:  AN  ACTIVITY  TEMPLATE  FOR 

THE  ADDITION  OPERATOR 
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Figure  II. C. 5:  PROGRAM  GRAPH  USING  ACTIVITY 
TEMPLATES  FOR  THE  DATA  FLOW  PROGRAM  GRAPH  OF 
Figure  II.  C.  3  DENNIS,198C3 


Tne  graph  representation  of  tais  computation  Is  saovn  In 
figure  II. C. 3. 

In  tae  extended  grapa  representation  scneme,  a  data  flov 
program  exists  as  a  collection  of  activity  templates,  each 
template  corresponding  to  a  node  In  tae  data  flow  program 
graph.  For  example,  figure  II. C. 4  snows  an  activity  template 
for  the  addition  operator.  There  are  four  fields  In  the 
activity  template.  Tae  first  field  denotes  tae  operation 
code  which  specifies  the  operation  to  be  performed.  The 
second  and  taird  fields  are  receivers,  wnica  are  locations 
waiting  to  receive  operand  values.  The  fourth  field  Is  a 
destination  field  which  specifies  where  the  result  of  the 
operation  on  the  operands  is  to  go.  There  can  be  multiple 
destination  fields.  Figure  II. C. 5  shows  tae  program  graph 
representation  of  figure  II. C. 3,  using  activity  templates. 

Activity  templates  have  been  developed  which  control  the 
routing  of  data  for  suca  program  structures  as  conditionals 
and  iterations.  These  templates  are  mentioned  to  point  out 
the  fact  that  graph  nodes  can  represent  not  only  simple 
operands  but  can  also  represent  more  elegant  and  necessary 
constructs. 

Some  definitions  which  are  necessary  to  tae 
understanding  of  the  data  flow  instruction  execution 
mechanism  follow.  First,  a  data  flow  program  instruction  is 
the  fixed  portion  of  an  activity  template  and  is  made  up  of 
tae  opcode  and  tae  destinations. 


instruction: 

<opcode,  destinations> 

Each  destination  field  provides  tne  address  of  some  activity 
template  and  an  input  (or  offset)  denoting  vnicn  receiver  of 
the  template  is  the  target. 

destination : 

<address,  lnput> 

Data  flow  program  execution  occurs  as  follows.  Tne 
fields  of  a  template  which  has  been  activated  (by  the 
arrival  of  an  operand  value  at  each  receiver)  form  an 

operation  packet: 

<opcode»  operands,  destinatlons> 

When  tne  operation  packet  nas  been  operated  upon,  a  result 
packet  of  the  form 

result  packet: 

<value,  destination> 

is  generated  for  each  destination  field  of  tne  original 
activity  template.  Result  packet  generation  triggers  tne 
placement  of  tne  value  in  the  receiver  designated  by  its 
destination  field.  Thus,  at  a  logical  level,  data  flow 
program  execution  occurs  as  a  consequence  of  operation 
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packet  and  result  packet  movements  tnrougn  a  macnine 
described  In  detail  below. 

The  basic  data  flow  Instruction  execution  mechanism  Is 
shown  in  figure  II. C. 6.  Tne  data  flow  program,  consisting  of 
a  collection  of  activity  templates,  is  held  In  the  activity 
store  (see  figure  II. C. 6).  Each  activity  template  is 
uniquely  addressable  within  the  activity  store.  When  an 
Instruction  Is  ready  to  be  executed  (1.  e.  the  template  Is 
enabled),  this  address  is  entered  in  the  Instruction  queue 
unit  (established  as  a  PIEO  buffer). 

The  fetch  unit  is  then  responsible  for:  removing,  one  at 
a  time,  instruction  addresses  from  the  instruction  queue, 
fetcning  the  corresponding  activity  template,  forming  an 
operation  pacicet  based  on  the  field  values  in  the  template, 
and  submitting  tne  operation  pacicet  to  an  operation  unit  for 
processing.  The  operation  unit  processes  the  operation 
pacicet  by:  performing  tne  operation  specified  by  the  opcode 
on  the  operands,  forming  result  packets  (one  for  each 
destination  field  of  the  operation  packet),  and  transmitting 
the  result  packets  to  the  update  unit.  The  update  unit  fills 
in  the  receivers  of  activity  templates  (designated  by  the 
destination  fields  in  tne  result  packets)  wltn  tne 
appropriate  values.  The  update  unit  is  also  responsible  for 
checking  the  target  template  to  see  if  it  has  all  receivers 
filled,  thus  enabling  the  template.  If  so,  the  address  of 


the  enabled  template  Is  added  at  the  end  of  the  Instruction 
queue  by  the  update  unit. 

At  tnls  point  it  Is  appropriate  to  discuss  now  and  where 

proeram  parallelism  can  be  exploited  by  tnls  hardware. 

"...once  the  fetch  unit  has  sent  an  operation  packet  off 
to  the  operation  unit,  it  may  immediately  read  another 
entry  from  the  Instruction  queue  without  waiting  for  tne 
instruction  previously  fetched  to  be  completely  processed. 
Thus  a  continuous  stream  of  operation  packets  may  flow 
from  the  fetch  unit  to  the  operation  unit  so  Ions'  as  the 
Instruction  queue  is  not  empty. 

"This  mechanism  is  aptly  called  a  circular  pipeline- 
activity  controlled  by  tne  flow  of  information  packets 
traverses  the  ring  of  units  leftwise.  A  number  of  packets 
may  be  flowln*  simultaneously  in  different  parts  of  the 
ring  on  behalf  of  different  instructions  in  concurrent 
execution.  Thus  the  ring  operates  as  a  pipeline  system 
with  all  of  its  units  actively  processing  packets  at  once. 
The  degree  of  concurrency  possible  is  limited  by  tne 
number  of  units  on  the  ring  and  the  degree  of  pipelining 
within  each  unit.  Additional  concurrency  may  be  exploited 
by  splitting  any  unit  In  the  ring  into  several  units  wnicn 
can  be  allocated  to  concurrent  activities."  [DENNIS, 
NOU980] 

The  Dennis-tfisunas  data  flow  architecture  for 
implementine  the  described  instruction  execution  mecnanism 
is  called  the  cell  block  architecture  and  is  illustrated  in 
figure  II. C. 7. 

"The  heart  of  this  architecture  is  a  lar*e  set  of 
instruction  cells,  each  of  wnicn  holds  one  activity 
template  of  a  data  flow  program.  Result  packets  arrive  at 
instruction  cells  from  the  distribution  network.  Each 
instruction  ceil  sends  an  operation  packet  to  tne 
arbitration  network  when  all  operands  and  signals  nave 
been  received.  The  function  of  the  operation  section  is  to 
execute  instructions  and  to  forward  result  packets  to 
target  Instructions  by  way  of  the  distribution  network." 
[DENNIS,  NOV1980] 
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Figure  II. C. 9  reflects  a  practical  form  of  the  ceil 
block  architecture  which  makes  use  of  LSI  technology  and 
reduces  the  number  of  devices  and  interconnections.  This 
practical  form  is  obtainable  by  grouping  tne  instruction 
cells  of  figure  II. C. 7  into  blocks,  each  of  which  is  a 
single  device.  In  this  organization,  several  cell  blocks  are 
serviced  by  a  group  of  multifunction  processing  elements. 
The  arbitration  network  channels  operation  packets  from  cell 
blocks  to  processing  elements.  Rathe.*  than  employing  a  set 
of  processing  elements  each  capable  of  a  different  function, 
which  is  one  design  option,  use  of  one  multipurpose 
processing  element  type  is  the  favored  approach.  Such  an 
approacn  precludes  the  need  for  tne  arbitration  network  to 
route  operation  packets  according  to  opcode.  Instead,  it 
simply  has  to  forward  operation  packets  to  any  available 
processing  element.  It  is  this  design  which  forms  tne  basis 
for  the  system  model  used  in  this  research  effort. 

How  does  tne  basic  mechanism  relate  to  the  cell  block 
architecture?  Figure  II. C. 9  shows  a  cell  block 
implementation.  It  differs  from  the  basic  mecnanism  in  two 
ways.  First,  the  cell  block  has  no  processing  element(s) 
(operation  unit(s)).  Second,  result  packets  targeted  for 
activity  templates  held  in  the  same  cell  block  must  traverse 
the  distribution  network  before  being  handled  by  the  update 
unit  [DENNIS,  NOV  1980] .  This  is  tne  Dennis-Misunas  data 
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Figure  II. C. 9:  A  SIMPLE  CELL  BLOCK  IMPLEMENTATION 


[PENNIS,  1980] 


flow  architecture  design.  Other  designs  do  exist;  for 
examples ,  see  [GOSTELOW,  1980  J  and  [WATSON,  1979J . 

D.  COMPUTER  PERFORMANCE  PREDICTION 

Computer  performance  prediction  is  an  evaluation  process 
which  proposes  to  estimate  the  performance  of  a  system  not 
yet  in  existence  (i.e.  in  some  state  of  design). 
"Performance"  simply  means  now  well  a  system  wortcs.  Tnls  in 
turn  connotes  the  concept  of  value.  So,  the  purpose  of 
estimating  the  performance  of  a  system  under  design  is  to 
determine  that  system's  expected  value. 

In  order  to  quantify  now  well  a  system  worics  or  shall 
wort,  performance  metrics  called  indices  are  used.  Typical 
indices  and  their  definitions  are: 

THROUGHPUT  RATE  -  Tne  volume  of  Information  processed 

hy  a  system  in  one  unit  of  time 

HARDWARE  UTILIZATION  -  Tne  ratio  between  tne  time  tne 

hardware  is  used  during  an  interval 
of  time,  and  the  duration  of  that 
interval  of  time 

RESPONSE  TIME  -  The  elapsed  time  between  tne  sub¬ 

mission  of  a  program  Job  to  a  system 
and  completion  of  tne  corresponding 
Job  output. 

Computer  performance  prediction  can  be  achieved  via 
several  different  techniques.  Each  technique  has  limitations 
and  advantages.  The  technique  utilized  in  this  thesis  is 
that  of  simulation.  The  simulation  technique  involves  the 
representation,  by  a  model,  of  certain  aspects  of  tne 
behavior  of  a  system  in  the  time  domain.  Observing  tnese 


aspects  of  toe  behavior  in  time  of  the  system's  model,  under 
Inputs  generated  by  a  model  of  tne  system's  inputs,  produces 
results  useful  In  the  evaluation  of  the  modelled  system 
[FERRARI,  1979J .  For  the  purposes  of  this  research,  the 
aspects  of  behavior  that  are  of  interest  are  the  performance 
indices  previously  defined. 

Of  significant  Importance  to  any  simulation  effort  are 
the  issues  of  validation  and  parameter  estimation. 
Conceptually,  validation  attempts  to  establish  some  degree 
of  confidence  that  the  simulation  shall  produce  results 
which  shall  closely  correspond  with  the  performance  of  the 
system  under  scrutiny.  Parameter  estimation  provides  the 
simulation  effort  with  hopefully  credible  parameter  values 
needed  to  perform  a  simulation  having  relevant  results. 
These  Issues  shall  be  addressed  in  section  IV. A: 
Experimental  Design. 

The  last  section  of  the  review  of  the  literature 
applicable  to  this  research  endeavor  presents  tne 
Requester-Server  methodology.  The  Requester-Server 
methodology  Is  the  "tool"  used  to  perform  the  simulation 
which  generates  the  results  on  which  tne  prediction  of  data 
flow  performance  Is  based. 

Readers  desiring  a  more  tnorougn  presentation  of  tne 
subject  of  computer  performance  prediction  are  referred  to 
[FERRARI,  1979J,  [COX,  1978] ,  [ALLEN ,  1980 J ,  [HAMMING, 
1975],  [SPRAGINS,  1980],  [BOZEN,  1980],  and  [SAUER,  1980]. 
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E.  REQUESTER-SERVER  METHODOLOGY 

The  Requester-Server  (R-S)  methodology  was  designed  and 
Initially  Implemented  by  L.  A.  Cox,  Jr.  [COX,  1978J . 
Subsequently,  the  Requester-Server  software  was  modified  by 
D.  M.  Stowers  [STOWERS,  1979]  to  run  on  the  PDP-11/50 
minicomputer  at  NPS»  Tnis  section  summarizes  tnose  portions 
of  [COX,  1978]  and  [STOWERS,  1979]  which  are  applicable  to 
and  necessary  for  tne  understanding  of  tnis  research. 

The  R-S  methodology  Is  capable  of  predicting  the 
performance  of  computer  systems  characterized  by 
asynchronous,  concurrent  behavior.  The  methodology  can 
predict  performance  at  both  tne  computer  system  and  computer 
Job  levels.  The  R-S  methodology  allows  tne  user  to 
separately  specify  tne  hardware  conf iguratlon(s )  to  be 
evaluated,  tne  software  (programs)  to  be  used  in  evaluating 
the  hardware  conf ieuration(s) ,  and  the  mechanism  or  policy 
for  allocating  hardware  resources  to  program  requests  for 
service.  The  methodology  mates  provision  for  variable  levels 
of  detail  (in  a  hierarchical  sense)  In  both  the  hardware  and 
software.  Finally,  tne  R-S  metnodology  Is  capable  of 
simulating  concurrency  In  both  the  hardware  and  software. 
Thus,  for  a  given  hardware  configuration,  the  control 
structure  mandated  by  the  software  can  be  mapped  onto  tne 
hardware  and  system  performance  analyzed  and  predicted. 

The  simulation  process  is  begun  by  representing  tne 
software  (programs)  and  hardware  as  two  separate  Petri  net 
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grapns.  In  tne  Petri  net  grapn  of  tne  software,  eacn  arc  can 
be  thought  of  as  bavin*  an  associated  propagation  delay,  tne 
extent  of  vnicn  is  dependent  upon  tne  nardware  configuration 
used  to  execute  tne  proeram.  If  these  delays  are  definable 
by  their  correlation  to  the  Petri  net  model  of  tne  hardware, 
tnen  performance  values  for  tne  indices  of  section  II. D 
(Computer  Performance  Prediction)  can  be  obtained  by 
executing  tne  Petri  net  model  of  the  software  on  the  Petri 
net  hardware  configuration^).  The  R-S  "tool"  serves  as  tne 
Interface  between  the  Petri  net  model  of  system  software  and 
Petri  net  model  of  system  hardware.  Tnis  interface  permits 
the  hardware  and  software  Petri  net  graphs  to  be  constructed 
separately.  Tnis  is  important  because  tne  control  structure 
and  sequencing  constraints  of  both  hardware  and  software  can 
be  maintained  separately.  This  permits  a  direct  and 
meaningful  representation  of  botn  tne  system  software  and 
hardware  being  modelled. 

The  source  file  wnicn  serves  as  tne  input  to  tne  R-S 
program  is  organized  into  three  sections.  The  software 
section  of  tne  input  file  consists  of  a  description  of  tne 
Petri  net  graph  representing  the  software  program(s)  to  be 
executed.  This  net  grapn  description  is  formulated  in  terms 
of  the  functions  and  constraints  of  the  services  required  of 
tne  nardware.  Tne  nardware  section  of  tne  input  file  is  made 
up  of  a  description  of  the  computer  system  components  and 
their  Interconnections.  This  description  can  be 
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(nlerarcnlcally )  at  a  bit-level  or  major  component  level, 
depending  on  the  system  aspects  under  scrutiny.  The  Petri 
net  graph  upon  which  tne  hardware  description  Is  based  Is 
constructed  in  terms  of  Its  operation  In  time.  The  last 
section  of  the  Input  file*  called  the  dynamic  section, 
provides  tne  user  of  tne  R-S  "tool"  a  place  to  denote  system 
Initial  conditions  by  leflnlng  the  hardware  and  software 
nets'  token  markings  at  tne  beginning  of  a  "run".  As  may  be 
recalled  from  section  II. B  (Petri  Nets),  both  the  software 
and  hardware  sections  merely  define  static  Petri  net 
structures.  Performance  prediction  follows  from  the 
attachment  of  significance  to  the  structures  and 
restrictions  on  token  movement  witnln  tnese  structures. 

The  dynamic  nature  of  Petri  nets  is  exploited  by  this 
R-S  methodology  as  follows.  The  software  net  representation 
makes  a  series  of  requests  for  the  services  of  the  hardware 
net  representation.  Repeatedly,  the  R-S  process  maps  these 
requests  for  service  onto  the  nardware  net  representation. 
At  each  "invocation"  the  R-S  process  "runs"  the  hardware  net 
to  provide  tne  service  requested  by  the  software  net.  Upon 
completion  of  each  of  the  service  requests,  the  R-S  process 
"runs"  the  software  net  representation  until  the  hardware  Is 
again  needed.  This  cycle  repeats  Itself  until  tne  software 
net  representation  has  been  completely  "run"  and  Its 
terminal  state  reached. 


Events  In  tbe  hardware  net  grapn  correspond  to 
operations  in  time.  A  collection  of  events  is  used  to 
represent  each  functional  unit.  Token  movement  through  the 
hardware  net  graph  corresponds  to  tne  flow  of  lata  and 
control  through  the  modelled  hardware  system.  A  simple 
hardware  net  description  is  provided  in  figure  II.E.l. 
Events  in  the  software  net  grapn  correspond  to  requests  for 
service.  As  an  example,  an  event  could  equate  to  a  request 
for  a  floating  point  multiplication.  The  flow  of  tokens  In 
the  software  net  graph  equates  to  the  logical  flow  of  the 
algorithm,  constrained  by  Its  implicit  data  dependencies  or 
sequencing  constraints.  A  simple  software  net  description  is 
provided  in  figure  II. E. 2. 

Together,  tne  software  and  hardware  net  graphs  can  be 
executed  in  such  a  way  as  to  simulate  the  operation  of  the 
computer  system  for  the  given  software  workload.  The 
interaction  of  the  two  net  graphs  is  orchestrated  by  the  R-S 
token  arbiter.  Network  simulation  begins  with  the  marking  cf 
the  "BEGIN"  node  of  the  software  net  graph.  This  net  graph 
is  then  executed  as  would  be  any  Petri  net  graph.  The 
arrival  of  a  token  at  any  place  In  tne  software  net  grapn 
indicates  a  request  for  service,  at  which  time  the  R-S  tocen 
arbiter  takes  control.  (Tne  type  of  service  requested  is 
denoted  by  the  type  of  the  place  and  is  defined  in  tne 
software  net  description.)  The  R-S  token  arbiter  removes  the 
token  from  the  software  net  and  then  permits  the  software 
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net  graph  to  continue  executing  until  no  furtner  moves  are 
possible.  The  R-S  toxen  arbiter  then  initializes  tne 
hardware  functional  unit  (net  graph  denoted  by  the  type  of 
service  requested)  by  marking  it  with  tokens.  The  hardware 

net  eraph  is  then  executed  one  step.  Tokens  reaching  events 
corresponding  to  service  completion  are  removed,  and  the 
token  of  the  software  net  which  originally  caused  the 
request  for  service  is  replaced,  by  the  R-S  token  arbiter. 
Repeating  this  sequence  of  actions  results  in  the  execution 
of  the  software  net  graph  by  the  hardware  net  graph.  A 
sample  input  file  dynamic  section  and  the  results  obtained 
from  executing  tne  software  and  nardware  net  graph 
descriptions  of  figures  II.E.l  and  II. E. 2  are  presented  in 
figure  II. E. 3.  Those  readers  interested  in  the 
Requester-Server  methodology  are  referred  to  [COX,  1978J  and 
[STOWERS,.  1979]  for  a  more  in-depth  discussion  of  its 
capabilities  and  usage. 

This  completes  the  necessary  review  of  the  literature 
required  to  understand  tne  research  that  follows.  The 
fundamental  concepts  of  the  various  approaches  to 
parallelism,  Petri  nets,  tne  data  flow  architecture, 
computer  performance  prediction,  and  the  Requester-Server 
methodology  have  been  reviewed.  The  next  section  presents 
the  two-part  hypothesis  which  this  research  addresses  and 
tests. 
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Figure  II.E.l: 


A  SAMPLE  HARDWARE  NET  GRAPH  AND 
INPUT  FILE  DESCRIPTION 
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Figure  II. E. 2:  A  SAMPLE  SOFTWARE  NET  GRAPH  AND 

INPUT  FILE  DESCRIPTION 
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M+J 


BEGIN  DYNAMIC  NET; 

MARK  GATE  WITH  1; 

COMMENT:  GATE  ENABLED  TO  ALLOW 

ONLY  ONE  OPERATION  IN 
PROGRESS  AT  ANY  TIME. 

EXECUTE  10; 

COMMENT:  EXECUTE  TEN  HW  CYCLES 

OR  UNTIL  PROGRAM  IS 
COMPLETE. 

END  DYNAMIC  NET; 


S5 ;  EXECUTE  10' 

* PROGRAM  EVENT  J+K  REQUESTS  HW  SVCS(l) 
* PROGRAM  EVENT  M+J  REQUESTS  HW  SVCS(l) 


TIME  =  1: 

TIME  =  2: 

TIME  =  3: 

* PROGRAM  EVENT  J+K  COMPLETES(3) 
TIME  =  4: 

TIME  =  5: 

TIME  =  6: 

* PROGRAM  EVENT  M+J  COMPLETES (6) 


S6 :  END  DYNAMIC  NET; 


Figure  II. E. 3:  A  SAMPLE  INPUT  FILE  DYNAMIC  SECTION 

AND  OUTPUT  FILE  LISTING 


III.  HYPOTHESIS 


Because  their  exist  several  data  flow  architecture 
proposals,  it  is  desirable  to  have  a  tool  witn  which  to 
predict  the  performance  of  the  diverse  designs  for 
comparison  purposes.  The  first  part  of  this  research's 
hypothesis  was  that  the  Petri  net-based  Requester-Server 
(R-S)  methodology  is  such  a  tool,  capable  of  predicting  the 
performance  of  data  flow  architectures  in  an  efficient, 
accurate  manner.  In  effect,  the  R-S  tool  was  to  be  tested. 

The  second  part  of  this  research's  hypothesis  was 
concerned  with  the  Cennis-Misunas  data  flow  architecture 
design.  This  design  was  chosen  for  two  reasons.  First,  there 
existed  adequate  information  in  the  literature  about  this 
design  on  which  to  base  an  accurate  model  for  simulation 
purposes.  Second,  tne  Dennis-Mlsunas  design  of  the  basic 
instruction  execution  mechanism  is  essentially  the  same  as 
several  other  schemes  in  various  stages  of  implementation 
[DENNIS,  1979].  The  hypothetical  challenge  to  this  design 
was  that  the  goal  of  achieving  higher  speed  computation  is 
not  attainable  unless  a  nigh  and  "intelligent”  degree  of 
multiprogramming  is  realized,  as  shall  be  explained  next. 

Obviously,  high  speed  computation  shall  require  a  high 
hardware  utilization.  By  this  it  is  meant  that  most  of  the 
processing  elements  (PE's)  shall  have  to  be  performing 
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useful  work  most  of  tne  time.  Sucn  a  nign  nardware 
utilization  Is  attainable  when  eitner  of  two  situations 
occurs.  First,  a  high  nardware  utilization  will  result  wnen 
a  process  possessing  a  large  amount  of  innerent  parallelism 
is  being  run  (by  itself)  on  tne  macnine.  In  tnls  case,  a 
program's  execution  time  is  dependent  upon  its  amount  of 
innerent  parallelism  and  tne  number  of  PE's  in  tne  macnine. 
Second,  a  nign  nardware  utilization  is  attainable  wnen  a 
multiprogramming  environment  (in  which  several  processes  aire 
permitted  to  simultaneously  run  on  the  machine)  is 
instituted.  In  sucn  a  multiprogramming  environment,  an 
individual  process  shall  be  competing  for  nardware  resources 
(PE's).  Tnus,  that  process'  execution  time  may  be  lengthy 
regardless  of  its  amount  of  inherent  parallelism.  This  is  so 
because  that  process  may  nave  tne  use  of  only  a  small 
portion  of  the  machine's  resources  (PE's)  at  any  point  in 
time.  Put  another  way,  if  at  any  time  a  process  has  N 
instructions  available  for  execution,  but  tnere  are  less 
than  N  PE's  available  for  executing  tnose  instructions  in  a 
parallel  fasnion,  tnen  tne  process'  execution  time  snail  be 
lengthened  over  what  it  could  be  if  it  had  sufficient  PE's 


In  sucn  a  situation,  a  scneme  may  be  needed  to  Implement 
a  policy  which  achieves  two  objectives: 

1.  maintaining  hlgn  Hardware  utilization  and 

2.  providing  an  acceptable  average  response  time  for 
a  user  requiring  a  given  amount  of  processing. 

"Acceptable  average  response  time"  Is  construed  to  mean  that 
tne  actual  response  time  of  any  particular  program  which 
requires  a  given  amount  of  processing  snail  not  be 
lengthened  considerably  over  what  It  would  be  If  tne  program 
were  executed  by  Itself  on  the  data  flow  macnlne.  Tnus,  It 
shall  be  desirable  to  minimize  the  affect  of  system  load  on 
an  Individual  program's  execution  time.  That  tne  second 
objective  should  be  met  even  at  the  expense  of  the  first 
objective  is  a  strong  point  made  by  [K1EINR0CK,  1976] .  By 
merely  mapping  processes  onto  the  data  flow  machine  as  they 
arrive.  It  Is  expected  that  objective  #1  shall  be  achieved 
but  at  the  expense  of  objective  #2.  This  situation  was 
expected  to  be  demonstrated  by  this  research. 

The  purpose  of  this  section  nas  been  to  "frame”  tne 
research  area  by  presenting  the  Issues  which  give  rise  to 
the  hypothesis.  Tne  following  section  presents  the  method 
used  to  test  the  hypothesis,  and  includes  a  discussion  of 
the  assumptions  male  to  facilitate  the  simulation 
experiments,  wnere  undecided  design  issues  remain. 
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17.  METHOD 


A.  EXPERIMENTAL  DESIGN 

The  experiment  to  allow  prediction  of  data  flow  computer 
performance  Involved  executing  sets  of  Petri  net  models  of 
data  flow  programs  on  Petri  net  models  of  data  flow 
hardware.  The  Requester-Server  (R-S)  program  tool  monitored 
the  data  flow  (model)  programs'  "execution"  and  provided 
data  which  permitted  the  determination  of  the  performance 
indices:  response  time  and  nardware  utilization.  It  Is 
Important  to  realize  that  the  results  of  this  research 
predict  the  performance  of  a  model  of  a  lata  flow  computer- 
not  that  of  an  operating  data  flow  macnine  itself.  (Model 
validation  and  parameter  estimation  Issues  are  addressed  In 
section  IV. B:  Data  Flow  Hardware  Definition.) 

The  reader  who  is  familiar  with  analytic  modelling 
employing  queueing  theory  may  as*  wny  tnat  tecnnlque,  rather 
than  the  simulation  technique*  Is  not  used  to  predict  the 
performance  of  the  data  flow  design.  The  answer  is  that  the 
analytic  approach  unnecessarily  constrains  the  prediction  by 
requiring  assumptions  to  he  made  about  the  software. 
Specifically,  the  Petri  net  models  of  software  programs  are 
discretely  defined  with  regard  to  the  amount  of  inherent 
parallelism  available  for  exploitation  at  eacn  time  step  In 
program  execution.  To  model  analytically*  the  variability  of 
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Inherent  parallelism  available  for  exploitation  must  be 
described  by  probability  distributions  which  hide  the 
definable  nature  of  the  programs  at  discrete  time  steps. 

lor  this  experiment,  the  data  flow  architecture  was  to 
be  modelled  with  several  different  quantities  of  processing 
elements  (PE's).  The  sample  data  flow  program  models  were  to 
be  characterized  by  varying  but  definable  amounts  of 
inherent  parallelism  available  for  exploitation.  Each 
(model)  program  was  to  be  separately  run  on  each  (model) 
hardware  configuration.  (Hereafter,  the  word  "model”  shall 
be  omitted  but  assumed  In  referring  to  the  program  and 
hardware  models  used  In  this  experiment.)  Data  was  to  be 
obtained  to  permit  determination  of  the  performance  indices 
(response  time  and  hardware  utilization),  for  each  run,  from 
the  monitor  function  of  tne  R-S  tool.  After  running  each 
program  separately,  arbitrary  program  mixes  were  to  be  run 
on  eacn  nardware  configuration  and  the  same  performance 
Indices  again  determined.  Finally,  hanl-optimlzed  program 
mixes  were  to  be  run  on  each  hardware  configuration  and  the 
same  performance  indices  determined  once  again.  By 
evaluating  the  results,  the  hypothesis  was  expected  to  be 
either  supported  or  refuted. 

The  Independent  variable  for  this  experiment  was  defined 
to  be  the  quantity  of  PE's  available  to  tne  hardware  model. 
Because  PE's  are  but  one  resource  demanded  by  a  process  in 
execution,  other  Independent  variable  choices  could  nave 
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Included  other  resources  such  as:  the  quantity  of  cell 
blocks  available,  the  type  of  distribution  or  arbitration 
network  employed,  and/or  the  type  of  PE's  (multipurpose  or 
sets  of  single-purpose  functional  units)  utilized.  Expanding 
the  number  of  Independent  variables  Increases  significantly 
the  complexity  of  evaluating  the  results.  How  these  issues 
were  resolved  Is  explained  In  section  17. B. 

The  dependent  variables  for  this  experiment  were  the 
parameters  response  time  and  nardvare  utilization.  The 
results  of  the  experiment  were  expected  to  provide  data 
which  could  be  plotted  on  graphs.  Curves  plotting  the 
execution  time  of  each  data  flow  program  against  the  number 
of  PE's  would  constitute  one  such  graph.  Others,  and  their 
significance,  are  presented  in  section  7:  Results  and 
Discussion. 

B.  DATA  FLOW  HARDWARE  DEFINITION 

The  Petri  net  models  of  the  data  flow  hardware 
configurations  were  quantified  in  terms  of  their  operation 
In  time.  Such  quantification  required  assigning  time 
duration  values  to  each  portion  of  the  cell  block 
arcnltecture  model  In  sucn  a  way  as  to  closely  model  tne 
hardware.  Doing  so  required  several  assumptions  to  be  made. 
Those  assumptions  shall  be  addressed  Individually  so  as  to 
help  substantiate  the  credibility  of  the  resultant  hardware 
models. 


-V 
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To  begin,  tne  processing  elements  (PE's)  were  assumed  to 
be  multifunctional,  capable  of  executing  any  instruction 


routed 

to  it  in  one  "standard" 

instruct!  on 

execution 

time 

unit . 

Allowing  the  PE 

's 

to  be 

multifunctional 

and 

characterized  by  a  singular 

execution 

time 

simplifies 

tne 

modelling  process. 

There  were  at  least  two  other  possibilities  tnat  could 
be  accommodated  by  expansion  of  the  Petri  net  models.  First, 
each  multifunction  PE  could  be  replaced  by  a  set  of 
single-purpose  PE's,  each  single-purpose  PE  defined  in  terms 
of  its  particular  Instruction  execution  time,  and  capable  of 
executing  concurrently  with  other  PE's  of  the  set.  Second, 
each  PE  could  be  replaced  by  a  subnet  in  which  only  one 
instruction  could  be  executed  in  any  given  time  step,  but 
the  model  would  define  the  execution  time  as  a  function  of 
the  instruction  type. 

The  first  alternate  approach  implies  a  more  complex 
arbitration  network  with  a  conceivably  longer  routing  time. 
The  second  alternative  would  require  additional  net 
complexity.  (However,  this  approach  would  be  a  good 
possibility  for  subsequent  research.)  Because  the  actual 
implementation  configuration  has  not  been  finalized, 
modelling  the  PE's  as  multifunctional  and  characterized  by  a 
singular  execution  time  was  a  reasonable  path  to  follow. 

The  distribution  network  design  also  has  not  been 
flnalllzed.  For  ease  of  modelling  purposes,  a  crossbar 
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switch  design  capable  of  supporting  simultaneous  transfers 
of  result  packets  to  cell  blocks  was  cnosen  to  be  modelled. 
This  choice  permitted  a  standard  routing  time  to  be 
characterized  by  the  model.  Other  network  designs, 
especially  packet  rout  In*  networks,  may  be  preferred  to  the 
crossbar  switch  for  the  ultimate  machine  because  of  their 
lower  cost  and  comparable  performance  In  a  data  flow 
architecture  [DENNIS,  1979J. 

The  choice  to  model  the  PE's  as  multifunction  units 
precluded  the  need  for  anything  but  a  simple  arbitration 
network.  Such  a  network  would  merely  have  to  route  operation 
packets  to  any  available  PE.  Accordingly,  in  the  model,  a 
standard  routing  time  for  this  network  was  characterized. 

With  regard  to  the  cell  blocks,  tne  assumption  was  made 
that  sufficient  cell  blocks  were  available  to  hold  all 
portions  of  all  processes  being  run  on  tne  machine  at  each 
and  every  Instant.  Thus,  there  Is  no  notion  of  paging 
portions  of  processes  Into  and  out  of  memory  (the  activity 
store  In  the  case  of  tne  data  flow  arcnltecture) .  Tnis 
assumption  carries  with  it  the  assumption  that  all  program 
compilation  (resulting  In  extended  data  flow  graph-Uke 
representations)  is  complete  before  beginning  program 
execution.  Other  compilation  strategies  are  under 
consideration,  such  as  requiring  the  user  to  Interact  with 
the  system  to  achieve  a  high  degree  of  parallelism 
exploitation  [Mc&RAW,  1980] . 


As  has  been  described,  other  hardware  choices 
(representing  Independent  variables  In  an  experiment)  can  be 
made  and  easily  implemented  by  simply  defining  appropriate 
subnets  vhlcn.  In  a  time-wise  fashion,  characterize  tne 
portions  of  tne  hardware  under  scrutiny.  The  approach  taicen 
In  this  research  permitted  tne  nardware  timing 
characteristics  to  be  a  function  of  simply  the  number  of 
PE's.  Figure  11.2.1  is  the  Petri  net  representation  (of  tne 
cell  blocs  architecture  hardware)  utilized  in  tnis  research. 
Por  the  purposes  of  this  experiment  and  In  the  configuration 
described,  eacn  PE  was  assumed  to  be  driven  at  the  rate  of 
two  million  floating  point  operations  per  second  (FIOPs),  a 
rate  claimed  to  be  reasonable  by  [DENNIS,  1980].  This  figure 
represents  an  Instruction  execution  time  of  500  nanoseconds 
(nsec).  (This  is  represented  In  the  hardware  model  by 
signifying  a  scaling  of  each  event /transition  pair  to  equal 
100  nsec.)  Associating  timing  characteristics  wltn  each 
component  In  the  data  flow  architecture  design  results  in  a 
similar  figure  as  shown  in  figure  IT.B.2. 

PE  (instruction  execution)  50  nsec 

CELL  BLOCK  (memory  fetcn  assuming  MOS  tecnnology)  250  nsec 
DISTRIBUTION  NETWORK  (assuming  crossbar  switch)  250  nsec 
ARBITRATION  NETWORK  (assuming  negligible) 

[WEITZMAN ,  1980].  TOTAL:  550  nsec 

FIGURE  IT.B.2:  TIMING  CHARACTERISTICS  OF  DATA  FLOW 

ARCHITECTURE  COMPONENTS 


For  tne  purposes  of  this  research  the  quantity  of  PE's  in 
tne  nardware  was  variel  from  one  to  sixteen,  by  multiples  of 
two*  This  resulted  in  data  venerated  for  the  following 
quantities  of  modelled  PE's:  1,  2,  4,  8,  and  16. 

Tnis  summarizes  tne  assumptions  utilized  in  developing 
the  model  presented  here.  The  following  section  describes 
tne  Petri  net  definition  of  tne  data  flow  software-  program 
models  which  were  "executed”  on  the  hardware  models. 

C.  DATA  FLOV  SOFTWARE  DEFINITION 

The  Petri  net  models  of  data  flow  programs  were 
quantified  in  terms  of  the  amount  of  inherent  parallelism 
available  for  exploitation  at  eacn  discrete  time  step  as 
well  as  in  terms  of  the  implicit  data  dependencies  of  tne 
programs.  (As  previously  mentioned,  the  data  dependencies 
define  tne  control  flow  of  a  program.)  Tne  initial  approacn 
involved  taxing  sample  programs  written  in  the  high  level 
language  (all)  FAL  and  converting  them  to  their  equivalent 
Petri  net  representations  for  subsequent  “execution"  on  tne 
data  flow  nardware  models.  The  problem  with  this  approach 
was  that  tne  compilation  process  is  not  yet  developed.  Thus, 
what  hardware  instructions  would  be  required  for  each  hll 
instruction  were  not  determinable. 

The  subsequent  approach,  which  was  utilized,  involved 
designing  Petri  net  program  models  characterized  by  various 
but  discretely  definable  levels  of  inherent  parallelism. 
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Executing  sucu  artificial  programs  conceivably  produced  more 
Informative  results  tban  would  nave  been  obtained  vltn  a  few 
select  programs  wblcb  may  nave  only  demonstrated  data  flow's 
suitability  for  tnose  special  purpose  computations.  Tne 
individual  programs  snail  be  characterized  after  Introducing 
a  new  concept. 

A  new  concept  introduced  at  tnls  point  Is  that  of  a 
software  "concurrency  vector".  A  concurrency  vector  Is  a 
tuple*  each  entry  of  wnicb  defines  tne  amount  of  inherent 
parallelism  in  a  program  at  tne  operation  packet 
hierarchical  level*  at  a  discrete  Instruction  execution  time 
step.  Each  entry  of  the  tuple  Is  Implicitly  subscripted  by 
tne  time  step  it  describes.  For  example,  tne  simple 
statistics  function  of  section  II. C  (see  figure  II. C. 2,  page 
32)  would  be  characterized  by  the  concurrency  vector: 
(4*2*2«2«1*1 ) .  In  this  example  concurrency  vector,  tne  "4" 
represents  the  fact  that  the  four  operations  "SQ", 
"S0"»  and  "SO"  could  be  processed  In  parallel  during  tne 
first  time  step  of  execution  of  the  simple  statistics 
function.  This  Is  so  because  no  sequencing  constraints  exist 
among  these  four  operations.  Thus  the  concurrency  vector 
defines  now  many  operation  packets  could  be  parallel 
processed  if  all  the  Instructions  (l.e.  functions-  addition, 
subtraction,  division*  square,  square  root)  were  Implemented 
in  nardware.  (If  they  were  not,  the  subfunctional  operation 
packets  required  by  the  Instruction  would  be  considered  in 
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defining  the  concurrency  vector  entries.)  It  should  also  be 
recognized  that  the  concurrency  vector*  though  a  function  of 
a  program,  Is  dependent  upon  a  standard  Instruction 
execution  time  duration.  If  the  hardware  is  implemented  such 
that  execution  time  Is  a  function  of  the  Instruction  type, 
then  the  concurrency  vector  entries  could  be  described  at  an 
even  lower  level  than  the  operation  packet  level.  Such  a 
level  would  correspond  to  a  basic  hardware  cycle  time,  where 
executing  an  Instruction  would  require  some  number  greater 
than  one  hardware  cycles  to  complete.  This  additional 
complexity  need  not  be  considered  In  this  research  In  view 
of  the  hardware  design  approach  taken,  but  could  be 
accommodated  by  the  R-S  methodology  used  here. 

Pour  programs  ("a”  through  "D")  were  utilized  in  this 
research.  These  programs  are  differentiable  by  tnelr  length 
as  well  as  by  the  amount  of  inherent  parallelism  available 
for  exploitation  at  eacn  time  step.  Tne  Petri  net 
representations  of  these  programs  are  shown  In  figures 
I7.C.1  through  17. C. 4.  Additionally,  the  concurrency  vector 
for  each  Is  shown.  The  program  mixes  for  tnis  experiment 
Included  one  of  each  of  the  three  programs,  "a",  "B”  and 
"D".  Another  program  mix  including  one  of  each  of  the  four 
programs,  "a"  through  "d",  was  also  used. 

The  following  section  presents  the  procedure  utilized  in 
executing  the  experiment.  Additionally,  the  method  of 
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mapping  tne  program  mixes  onto  eacn  of  me  hardware 
configurations  Is  explained. 

D.  PROCEDURB/IMPLEMENTATION 

Tbe  four  procedural  steps  utilized  in  executing  tnls 
experiment  were  as  follows.  First ,  Petri  net  models  of  both 
tbe  hardware  configurations  (figure  I7.B.1)  and  software 
programs  (figures  I7.C.1-.4)  were  converted  to  a  format 
acceptable  as  Input  to  tbe  Requester-Server  (R-S)  program. 
Two  Pascal  programs,  compiled  and  executed  on  tbe  NPS 
"B-side"  PDP-11  (a  UNIX-based  system),  facilitated  tne 
(separate)  generation  of  tbe  hardware  and  software  portions 
of  tne  Input  files  for  tne  R-S  program.  Eacn  input  file  was 
formed  by  concatenating  the  hardware  and  software  portions 
and  then  editing  tne  resulting  file  to  define  the  dynamic 
execution  desired.  Tbe  second  step  was  to  transfer  each 
complete  input  file  from  tne  NPS  "B-slde”  to  tne  NPS 
"l-slde"  PDP-11  (an  RSX-llM-based  system),  via  an 
inter-processor  linx.  Tnirdly,  tne  R-S  program  was  run  on 
tne  "A-slde",  talcing  as  input  tne  file  wnich  defined  tne 
hardware,  software,  and  dynamic  execution  desired.  Fourtb 
and  finally,  data  regarding  the  execution  of  tbe  software  on 
tbe  hardware  was  obtained  from  the  output  file  generated  by 
tbe  R-S  program.  Tne  results  of  the  data  analysis  are 
presented  and  discussed  in  section  7  (Results  and 
Discussion) . 
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The  Implementation  portion  of  tnls  section  addresses  tne 
technique  used  to  map  the  software  onto  the  hardware  in  such 
a  way  as  to  effectively  simulate  this  function  as  it  mlgnt 
be  done  on  a  real  data  flow  machine.  The  first  set  of 
experimental  "runs'*,  wnicn  consisted  of  tne  separate  running 
of  each  program  on  each  hardware  configuration,  was 
straightforward  in  implementation.  The  procedure  described 
above,  in  which  eacn  program  file  portion  was  concatenated 
with  the  appropriate  hardware  file  portion,  achieved  a 
relevant  mapping  for  modelling  a  single  process  running  on  a 
particular  hardware  configuration.  Tne  subsequent  set  of 
experimental  "runs",  in  wnicn  a  program  mix  was  mapped  onto 
each  of  the  hardware  configurations,  was  not  so 
straightforward  in  implementing  as  shall  be  explained  next. 

To  understand  the  mapping  of  software  (1.  e.  processes) 
onto  data  flow  hardware,  it  is  helpful  to  scrutinize  the 
functions  of  tne  operating  system  for  sucn  a  macnlne. 
Because  the  scheduling  and  synchronization  of  concurrent 
activities  are  built  in  at  tne  nardware  level,  a  data  flow 
machine's  operating  system  will  only  be  responsible  for 
initialization,  termination,  and  Input/output  (I/O)  of 
processes.  Once  a  process  is  mapped  onto  tne  data  flow 
machine,  it  runs  to  completion  without  further  intervention 
by  tne  operating  system  (except  for  I/O).  Tne  question  which 
must  be  answered  is:  When  should  another  process  be  mapped 
onto  a  macnlne  wnicn  is  already  executing  one  or  more 
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processes?  Tbus,  In  defining  the  input  file  of  a  program  mix 
representative  of  ready  processes,  the  program  mix  had  to  be 
defined  In  terms  of  a  mapping  function. 

The  mapping  functions  can  be  thought  of  as  operating 
system  assignment  policies.  Thus,  for  those  runs  Involving 
program  mixes  (as  opposed  to  single  programs),  an  assignment 
policy  had  to  be  simulated.  One  aspect  of  this  research  then 
can  be  viewed  as  an  investigation  of  different  policies  for 
mapping  processes  onto  data  flow  machines  in  a 
multiprogramming  environment. 

Each  program  mix  consisted  of  the  three  programs  "a", 
"b"  and  "D".  (A  later  "run”  for  which  data  was  gathered 
utilized  the  four  program  mix  consisting  of  one  each  of  the 
programs  "a"  through  "D".)  Each  mix  was  varied  in  the  way  In 
which  It  was  mapped  onto  the  hardware.  In  simulating 
different  operating  system  mapping  functions.  The  operating 
system  assignment  policies  for  mapping  a  program  mix  onto 
the  hardware  configurations  follow.  Three  policies  were 
simulated.  First,  the  three  programs  were  permitted  to  begin 
"execution"  at  the  same  time.  Second,  an  "80%  Rule”  was 
simulated  in  wnlcn  an  additional  program  was  permitted  to 
begin  "execution"  whenever  the  hardware  utilization  dropped 
below  80%.  Third,  an  "intelligent"  assignment  policy  was 
Implemented  via  a  mapping  function  based  on  the  programs' 
concurrency  vectors.  This  assignment  policy.  It  was 
envisioned,  would  cause  optimal  performance  in  terms  of  the 


performance  Indices:  response  time  and  nardware  utilization. 
The  concurrency  vector  approach  optimizes  the  assignment  of 
processes  onto  the  macnlne  by  fitting  together  concurrency 
vectors  of  ready  processes  In  such  a  way  that  the  objectives 
noted  In  section  III  are  achieved.  For  example,  given  a 
machine  with  eieht  PB's,  the  concurrency  vectors  would  be 
fitted  as  shown  In  figure  IV.D.l. 


JOB  A 
JOB  "B* 
JOB  "c‘ 
JOB  "D* 
JOB  "E’ 

TOTAL: 


(4, 3, 2, 1,2, 3,4) 

(3, 5, 7, 6, 5, 2) 

(4,4, 4, 2,1) 

,9,4,4) 

(2,7,6,  .‘ 

,3, 8, 8, 8, 8, 8, 6, 8, 6, 8,7 ,8,  * 
=**«=TIME=====> 


FIGURE  IV.D.l :  AN  EXAMPLE  OF  "FITTING"  CONCURRENCY 

VECTORS  TOGETHER 


By  generating  tne  concurrency  vectors  at  compile  time,  the 
program  can  declare  beforehand  those  resources  (as  a 
function  of  time)  needed  for  execution  as  well  as  wnen  tne 
program  will  be  completed.  An  operating  system  can  thus 
choose  the  sequencing  of  the  running  of  the  waiting 
processes  to  achieve  the  best  fit  to  best  meet  tne 
objectives  of  section  III. 

The  results  of  these  experimental  runs  are  presented  in 
the  following  section  in  graphical  form.  Additionally,  the 
meaning  and  significance  of  the  results  are  discussed. 


V.  RESULTS  AND  DISCUSSION 


In  response  to  tne  first  part  of  tnis  research's 
hypothesis,  It  Is  proposed  that  the  Petri  net-based 
Requester-Server  (R-S)  methodology  Is  Indeed  a  desirable 
tool  with  which  to  predict  the  performance  of  the  diverse 
designs  of  data  flow  architectures.  The  ability  to 
separately  specify  the  hardware,  software  and  resource 
allocation  policy  was  an  R-S  feature  which  permitted 
efficient  generation  of  the  combinations  of  tne  above  three 
Items.  The  ability  to  easily  implement  variable  levels  of 
detail  In  both  the  hardware  and  software  was  not  exploited 
but  the  method  for  doing  so  was  Introduced.  Finally,  the  R-S 
methodology's  capability  of  simulating  concurrency  and 
asynchronous  behavior  in  both  the  nardware  and  software  Is  a 
necessity  for  accurately  modelling  and  simulating  data  flow 
computing. 

The  results  which  address  the  second  part  of  tnls 
research's  hypothesis  are  now  presented.  To  begin,  figure 
7.1  shows  Individual  program  execution  times  as  a  function 
of  the  number  of  processing  elements  (PE's).  These  absolute 
execution  times  were  used  as  a  basis  for  comparison  with  the 
results  from  the  multiprogramming  environment  runs.  Percent 
hardware  utilization  Is  displayed  adjacent  to  each  data 
point.  The  hardware  utilization  values  are  averages  of  tne 


74 


hardware  utilizations  at  each  tine  step  during  execution. 
Tne  graph  snows  tnat  a  program's  execution  time  can  he 
drastically  reduced  by  increasing  the  number  of  processing 
resources  (PE's)  available  up  to  tne  point  where  execution 
time  is  bounded  by  the  amount  of  inherent  parallelism 
available  for  exploitation  in  tne  program. 

The  following  results  pertain  to  the  running  of  tne 
program  mixes  in  a  simulated  multiprogramming  environment. 
Initially,  it  was  intended  that  program  mixes  would  contain 
a  greater  quantity  of  programs  than  were  actually  run.  This 
was  not  acnieved  due  to  time  constraints.  Accordingly,  tne 
results  should  be  considered  preliminary  in  nature.  On  a 
positive  note,  the  results  provide  insight  into  several  data 
flow  operation  issues.  Figure  7.2  provides  tne  raw  data  for 
this  research  with  the  exception  of  the  data  utilized  in 
computing  nardware  utilization.  Figures  7.3,  7.4,  and  7.5 
present  the  hardware  utilization  (as  a  function  of  time)  for 
tne  4-,  8-,  and  16-PE  configurations.  Similar  graphs  for  the 
1-  and  2-PE  configurations  are  presented  in  figure  7.6  (note 
the  identical  nature). 

The  implications  supported  by  this  data  follow.  Tney  are 
not  definitive  because  of  the  small  quantity  of  programs  and 
program  mixes  in  the  model.  Tnus,  wane  tne  second  part  of 
the  hypothesis  may  not  have  been  adequately  tested,  the 
methodology  for  doing  so  appears  to  be  available  in  the  R-S 
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tool.  Additionally,  it  is  maintained  tnat  the  initial 
results  support  tne  discussion  which  follows. 

Whenever  the  amount  of  concurrency  in  all  running 
processes  exceeds  tne  number  of  PE's  available  to  meet  the 
requirements  of  the  processes,  some  slowdown  in  execution 
time  results  for  some  processes.  The  dual  of  this  result  Is 
that,  so  lone  as  there  are  adequate  PE's  available,  no 
slowdown  in  any  process'  execution  time  results. 

Under  tne  "All  Begin  Togetner"  ( ABT)  assignment  policy, 
the  data  flow  hardware  becomes  "overloaded",  resulting  in 
the  slowdown  Just  described.  For  example,  program  "a", 
thoueh  the  first  to  begin  processing  under  the  ABT  scheme, 
is  the  last  to  finish  under  the  three  program  mix.  Under  the 
four  program  mix  witn  16  PE's,  the  "c"  program  taxes  longer 
only  because  of  its  great  length  and  inherent  parallelism. 

Under  tne  ”80%  Rule"  assignment  policy,  the  average 
hardware  utilization  is  lower  than  tnat  under  tne  ABT  policy 
(for  the  three  program  mix).  Also,  the  average  of  the  three 
programs'  execution  times  is  lower  tnan  that  under  the  ABT 
policy.  For  the  four  program  mix,  the  average  hardware 
utilization  is  sligntly  greater.  This  reflects  a  better 
mapping  of  processes  onto  the  machine.  (Average  hardware 
utilization  is  defined  as  tne  average,  over  the  duration  of 
a  run,  of  the  hardware  utilization  percentages  at  each  time 
s  tep  of  tha  t  run. ) 
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Under  the  optimized  concurrency  vector  (CV)  approach, 
programs  were  mapped  onto  the  hardware  configurations  in 
such  a  way  as  to  achieve  a  high  hardware  utilization  at  eacn 
tine  step  as  well  as  minimize  the  average  response  time  of 
tne  programs  in  tne  mix.  The  results  Indicate  average 
hardware  utilizations  at  least  as  high  as  under  either  of 
the  otner  assignment  policies.  Also,  the  average  response 
times  were  at  least  as  low  as  under  either  of  the  other 
assignment  policies.  This  optimized  concurrency  vector 
approach  should  be  suitable  for  machine  optimization.  Using 
concurrency  vectors  generated  at  compile  time,  the  mapping 
of  additional  processes  onto  tne  data  flow  machine  should 
probably  continue  only  so  long  as  acceptable  average 
response  time  for  any  process  is  not  exceeded.  When  a 
process  characterized  by  mor*  inherent  parallelism  than  can 
be  currently  accommodated  on  the  data  flow  machine  is 
awaiting  assignment  (i.  e.  mapping  onto  the  data  flow 
machine),  that  process'  assignment  should  be  delayed  until 
sufficient  (or,  if  necessary,  all)  PS's  are  available  to 
parallel  process  the  computation.  (A  user  advisory  denoting 
such  a  delay  would  be  highly  desirable.) 

The  time  spent  in  preprocessing  Jobs  in  accordance  with 
any  assignment  scheme  to  achieve  some  level  of  optimization 
may  be  unnecessary  or  even  wasteful.  This  trade-off  will 
have  to  be  examined  in  greater  depth. 
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VI.  SUMMARY 


Following  a  review  of  the  pertinent  literature,  a 
two-part  hypothesis  was  proposed.  First,  the  Petri  net-oased 
Requester-Server  (R-S)  methodology's  suitability  for 
predicting  the  performance  of  data  flow  macnines  was  to  be 
tested.  Second,  it  was  hypothesized  that  the  goal  of 
economically  achieving  higher  speed  computation  through  data 
flow  computing  would  be  unattainable  without  achieving  a 
high  and  Intelligent  degree  of  multiprogramming.  The  R-S 
methodology,  a  simulation  technique,  permits  the  separate 
specification  of  the  hardware  to  be  evaluated,  the  software 
to  be  used  in  the  hardware  evaluation,  and  the  policy  for 
allocating  hardware  resources  to  program  requests  for 
service.  Accordingly,  Petri  net  models  of  data  flow  hardware 
configurations  were  quantified  in  terms  of  tneir  execution 
in  time,  and  Petri  net  models  of  data  flow  programs  were 
quantified  in  terms  of  the  amount  of  inherent  parallelism 
available  for  exploitation  at  each  discrete  time  step,  as 
well  as  in  terms  of  the  implicit  data  dependencies  of  the 
program.  Model  programs  were  ’’run"  on  model  nardware 
configurations.  Results  obtained  from  the  monitor  function 
of  the  R-S  program  were  analyzed,  with  respect  to  the 
performance  indices:  hardware  utilization  and  response  time. 
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Three  assignment  policies  for  determining  when  to  map 
additional  programs  onto  a  data  rlow  macnine  were  tested: 

1.  all  programs  begin  togetner 

2.  assien  an  additional  program  whenever  the 
hardware  utilization  drops  below  90% 

3.  assien  an  additional  program  based  on  a 
concurrency  vector. 

Results  show  that  the  R-S  methodology  is  indeed  an 
efficient  and  easy-to-use  tool  for  investigating  data  flow 
architectures.  Also,  initial  results  indicate  tnat  optimized 
scheduling  based  upon  concurrency  vectors  is  viable  for 
deciding  when  to  map  additional  processes  onto  a  data  flow 
machine  to  achieve  the  objectives  of  maintaining  hieh 
Hardware  utilization  and  providing  acceptable  average 

response  time. 

i 


< 
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¥11.  RECOMMENDATIONS  FOR  FURTHER  RESEARCH 


With  regard  to  the  methodology,  worthwhile  additions  to 
the  R-S  program  would  be  user-friendly  "front-"  and 
"back-ends”  which  would  further  simplify  both  the  veneration 
of  input  files  for  tne  R-S  tool  and  the  retrieval  of  desired 
data  from  the  output  file  venerated  by  each  run. 

In  the  area  of  data  flow,  simulations  in  which  the 
hardware  definition  of  tne  arcnitecture  was  varied  (as 
described  in  section  17. B)  could  provide  insivhts  revardinv 
the  optimal  hardware  configuration  for  tne  expected  program 
load.  In  particular,  the  quantity  of  (modelled)  PE's  should 
be  increased  to  a  number  closer  to  tne  amount  expected  in 
the  actual  machine  (approximately  512).  Of  course,  in  order 
to  model  more  accurately,  the  expected  load  in  terms  of  the 
quantity  of  programs  and  typical  amounts  of  inherent 
parallelism  shall  have  to  be  defined  more  exactly. 

A  final  open  area  within  data  flow  research  is  tne 
development  and  testinv  of  specific  algorithms  using 
concurrency  vectors  to  permit  machine  optimization  and 
implementation  of  a  desirable  assignment  policy. 
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